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Abstract 



The folding ability of a heteropolymer model for proteins subject to Monte Carlo dynamics on a simple cubic lattice 
is shown to be strongly correlated with the energy gap between the native state and the structurally dissimilar part 
of the spectrum. We consider a number of estimates of the energy gap that can be determined without simulation, 
including the gap in energy between the native and first excited fully compact states for sequences with fully compact 
native states. These estimates are found to be more robust predictors of folding ability than a parameter a that requires 
simulation for its evaluation: a = 1 — Tf/Tg, where Tf is the temperature at which the fluctuation of the order 
parameter is at its maximum and Tg is the temperature at which the specific heat is at its maximum. We show that the 
interpretation of Tg as the collapse transition temperature is not correct in general and that the correlation between a 
and the folding ability arises from the fact that a is essentially a measure of the energy gap. 



To function as a protein, a polypeptide must possesses a unique native state that is stable at a physiological tem- 
perature and be able to find that state in a reasonable time (milliseconds to minutes) at that temperature in spite of the 
fact that the number of possible configurations prevents it from making an exhaustive search (Levinthal paradox) |jl|]. 
For understanding the mechanism of folding and for protein design, it is important to determine the properties that 
distinguish polypeptide sequences that satisfy these requirements from those that do not. One approach is to compare 
sequence attributes with folding ability. Analytical heteropolymer theory based on a random energy model suggested 
that a large gap between the ground (native) state and the states (folds) that are structurally dissimilar is sufficient 
to allow a sequence to adopt a unique, stable structure Sali, Shakhnovich, and Karplus (SSK) demonstrated 

computationally that, for a 27-mer random heteropolymer model of a protein subject to Monte Carlo (MC) dynamics 
on a simple cubic lattice, a large energy gap promotes fast folding as well [^^. The energy gap criterion has been 
confirmed by similar studies by other authors and has been used in the design of fast folding sequences [^. 
Moreover, SSK showed that, when the ground state is maximally compact, the gap criterion can be simplified to a 
consideration of the difference in energy between the ground state and the first fully compact (3x3x3) excited 
state 

Based on results obtained with a set of 15-mer and 27-mer sequences, Klimov and Thirumalai (KT) argued that 
folding ability is not determined by the energy gap but instead by the parameter a ~ 1 — Tf/Tg, where Tj is the 
"folding" transition temperature (defined as the maximum in the fluctuation of the order parameter) and Tg is the 
"collapse" transition temperature (defined as the maximum in the specific heat) KT found that a correlates 

positively with the folding time (small a yields fast folding). In the present paper, we show that the maximum in the 
specific heat (Tg) is not the temperature of the collapse transition in general and that a is essentially a measure of the 
energy gap. In light of these considerations, we re-examine the results for the sequences studied by KT. We find that 
other measures of the energy gap and native state stability, such as the Z-score, correlate more strongly with folding 
ability than does the parameter a. 

The specific model is a self-avoiding heteropolymer chain on a three-dimensional simple cubic lattice. The energy 
function is the sum over all contacts (non-bonded spatial nearest-neighbors): 

E^Y. ._B^jAir,j-a) (1) 

where = Ir; ^ r^ l is the distance between monomers i and j located at and Vj, respectively, and a = 1 is the 
lattice spacing. The function A{rij — a) selects the interacting monomer pairs; A(0) = 1 and otherwise. The Bij 
give the specific interaction energies between monomers i and j; a complete set of Bij defines a sequence. The Bij 
are chosen randomly with Gaussian probability distribution: 
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The quantity Bq (Bq < 0) is a mean attraction between monomers that corresponds to an overall hydrophobic term. 
The 200 27-mer sequences studied in |^] were generated with Bq — —2.0 because that value guaranteed that most 
sequences had maximally compact ground states which could be found by examining the enumerated ensemble of 
103,346 maximally compact structures [|l2|. In the present study, we use the same sequences as in [|l^|ll]]; there are 
nine 15-mers with Bq = —2.0, thirty-two 15-mers with Bq ~ —0.1, two 27-mers with Bq = —2.0, and fifteen 27- 
mers with Bo — —0.1. KT included the value Bq = —0.1 because they felt that it produced a more realistic fraction 
of hydrophobic interactions. 

It should be noted that some of the sequences (KT have not stated which ones) were optimized by a Monte Carlo 
procedure in sequence space JTl]]. This procedure differs from that used by others in that individual elements of 
the Bij matrix were interchanged, rather than entire rows and columns (the latter is equivalent to switching residue 
types in a linear sequence). This optimization procedure is free to make the native contacts the lowest energy ones, so 
that there is no frustration, and, because only odd \ i — j\ > 3 can form contacts on a simple cubic lattice, to change the 
distribution of the Bij that actually contribute to the energy function. The averages of the Bij with odd \i — j\ (Bij) 
differed substantially from Bq in the 58 sequences studied; —0.805 < Bij < 0.690 for sequences with Bq = —0.1 
and -2.413 < B,j < -1.708 for sequences with Bq -2.0. 

We follow KT and use as an order parameter the complement of native pairwise distances [fol p4|] : 
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where r° is the distance between i and j in the ground state, and we use as a measure of compactness the total 
number of contacts (C). Boltzmann weighted averages of these quantities were calculated exactly for the 15-mer (all 
93,250,730 self-avoiding conformations can be enumerated) and by Monte Carlo simulation for the 27-mer (see [ pi] ] 
for the protocol). Following KT, the "folding" transition temperature (Tf) is taken to be the temperature at which the 
fluctuation in the order parameter Ax{T) = (x^) — (x)^ has a maximum, and the "collapse" transition temperature 
(Te) is taken to be the temperature at which the heat capacity C„(r) = AE/T'^ = {{E'^) - (E)'^) /T^ hasamaximum 
(units are chosen to make the Boltzmann constant equal to 1). There is one 15-mer sequence with Bq ~ —0.1 that 
exhibited two peaks in Cv{T), three that exhibited shoulders with derivatives close to zero in Cv{T), and one that 
exhibited a shoulder with a derivative close to zero in Ax{T); in these cases, for consistency with KT, we calculated 
the corresponding transition temperature {Tg or Tf) by a weighted average of the two temperatures corresponding to 
the zero derivative points (see Eqn. 10 of [|ll]|). 

From Tf and Tg, we can calculate the parameter = 1 — Tf /Tg, which was argued by KT to determine folding 
ability [|lOi|ll|]. To better understand Tf, Tg, and cr, we examine the heat capacity {Cy), the fluctuation in the order 
parameter (Ax), and the fluctuation in the total number of contacts (AC) as functions of temperature for several 
representative 15-mer sequences (Fig. |l]). When Bo = —2.0, the collapse is determined by the overall attraction 
between monomers; Cv{T) reaches its maximum (T — 1.340 for sequence 91 and T — 1.605 for 95) at a much lower 
temperature than does AC{T) (T — 3.160 and T w 3.320, respectively). When Bq — —0.1, the collapse transition 
is driven by specific contacts so that the folding and collapse transitions are closer in T. In the case of sequence 62 
(cr = 0.006), the peak in (T = 0.860) is closer to the peak in Ax (T = 0.855) than to the peak in AC (T = 0.965). 
In the case of sequence 5 (cr = 0.621), the peak in (T — 0.580) is closer to the peak in AC (T — 0.850) than to the 
peak in Ax (T — 0.220), but there is a substantial shoulder in Cy at the temperature corresponding to the peak in Ax, 
which is not uncommon for sequences with larger a. Although Tf is clearly associated with the folding transition, Tg 
cannot be interpreted as the temperature of the collapse transition in general. 



It is apparent from the above that cr must have a different physical meaning than that given in [|1C|,|1 1[. To better 
understand cr, we consider a simple two-state model with free energy Fi ^ Ei — TSi (i = 1, 2). By substitution into 

Ax = (x^) - (x)^ wefind ^ ^ 

Ax^(Xi"X2) \ '''^^J\\,2 - (4) 
[l + exp(^i^)] 

Similarly, 



To determine the maxima (Tf and Tg, respectively), we take the derivates with respect to temperature and set them 
equal to zero; it is important to note that xi — X2 and Ei — E2 are temperature dependent. We solve the resulting 
equations by expanding around the true transition point [Tt — T{Fi = F2)]: 

'^^)^l-(i?i-i?2)^. (6) 

Substitution of Eqn. ^ into dAx/dT = and dC^/dT = and straightforward rearrangement yield 

Tg-Tf^l/iE,-E2)^. (7) 

In other words, Tg and Tf deviate from the true transition (Tt) in such a way that a measure of their difference 
(cr) is inversely proportional to the square of the energy gap. Although lattice models (particularly the short chains 
considered here) deviate from two-state behavior, the simple model suggests that cr is related to the energy gap. 

We now compare the correlation between a and folding ability to that obtained with other measures of the native 
state stability. These include the transition temperatures themselves (Tg and Tf) and the Z-score of the native state [^, 
which more directly measures the degree to which a native state is separated from the majority of states in the spectrum. 
For all sequences, we estimated the Z-score by Zb = {Eq — C^Bij )/<jb, where Eq is the energy of the native state, Co 
is the number of contacts in that state, is the average Bij and cr^ is the standard deviation of Bij . For the 15-mers, 
it was possible to compare this estimate to the exact Z-score: Ze — {Eq ~ E)/(Te, where E is the unweighted average 
energy and as is the standard deviation of energies. As expected, the two quantities are correlated (Tables ^ and 0). 
For the sequences that have fully compact ground states, the Z-score is closely related to the energy gap between the 
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ground and first excited compact states (Acs) [Kl- To compare with we include statistics for Acs for the nine 
27-mers with Bq — —0.1 that have 3x3x3 ground states. Inclusion of the six sequences that have non-compact 
ground states would be inappropriate, because Acs is unrelated to the Z-score in these cases (rA^^Zs = —0.306). 

The folding ability of a sequence was measured by the mean-first passage time for finding the ground state (MFPT). 
The MFPT was calculated from 100 Metropolis Monte Carlo trials with a move set identical to that used previously 
|^,^,|l^,[ri|]. Trials began in a random configuration and were allowed to proceed for up to 10^ steps. Following KT, 
the simulation temperature (Tg) was chosen to yield a constant value of the order parameter: Tg — T{{x) = 0.21). A 
similar temperature was used by SSK: ~T{X — 0.8), where X — 1 — and pi is the Boltzmann probability 

of conformation i [|]|. Since Tg is determined from an order parameter, it is closely related to Tf, and is thus another 
measure of the native stability. 

The logarithm of the MFPT is plotted against a, Zb, and Tg for the 15-mers (Fig. ||) and the 27-mers (Fig. H). 
Overall, Figs. 2^ andHa are in good agreement with Figs. 2 of and 1 1 of [[lo|]. Pearson linear correlation coeffi- 
cients (Tables |and |l|) are calculated separately for Bq = —0.1 and Bq = —2.0 since these represent different solvent 
conditions. Although there is a high correlation between a and MFPT for sequences with Bq — —0.1, the correlation 
is low for those with Bq = —2.0. In contrast, Zb and Tg both do well and are comparable for both values of Bq. For 
the 27-mers, the most sensitive measure of folding ability is clearly Tg; these results are in agreement with [^], where 
it was found that a temperature at which the order parameter had a particular value (Tx) yielded the best predictivity of 
folding ability. For the sequences with compact ground states, Acs correlates more strongly with folding ability than 
does any other measure, particularly a (Table ||). 

The correlation between Acs and folding ability is obscured in the studies by KT (Fig. 22 of because they 
include sequences with native states that are not fully compact, even though it is clearly stated in [|]|that doing so is 
inappropriate (see also above). Instead, they compare folding ability to the gap between the the native state and the 
first excited state from the complete conformational ensemble (A) (Fig. 20 of [|ll]]). Typically, the first non-compact 
excited state is a tail flip, and thus that gap need not be correlated with either Acs or the Z-score. Moreover, it should be 
noted that, even if the correct gap (Acs) is used with the appropriate sequences (those with fully compact native states), 
it is incorrect to "normalize" by the simulation temperature {Tg) as KT have done in Fig. 3 of [ p^ and Fig. 21 of JTl]]. 
Because Acs and Tg are closely related, they both increase as the magnitude of the Z-score increases. Consequently, 
their ratio, Acs/Tg, is not expected to correlate with stability or folding ability. The calculations presented here are 
completely consistent with those of KT [|lO[Q; it is only the interpretation that differs. 

In the study by KT and the present one, the simulation temperature differs for each sequence. This provides 
a physically meaningful approach since it is appropriate to compare the folding of sequences under conditions of 
corresponding stability; a protein sequence must not only be able to find its native state but it must be dominantly 
populated at equilibrium [^. It has been suggested that the correlation between folding ability and the energy gap 
found in and, by inference, here derives from the variation in simulation temperature [pj|]. To address this concern, 
we performed simulations for the 15-mers at three sets of constant temperatures T = 0.8, 1.0, and 1.2 (Table |l|). 
Although the correlations are somewhat reduced, they remain highly significant. In particular, the correlation between 
the logarithm of the MFPT and the exact Z-score (Ze), the most direct measure of the energy gap, is essentially 
unchanged. 

Thus, for the same 58 sequences as were studied in |jl0|, pT||, it is evident that, although a is well correlated with 
the folding ability for sequences with small Bq (Bq — —0.1), it is not for sequences with large Bq (Bq = —2.0). 
The difference between the two sets of sequences stems from a difference in the dependence of a on the Z-score (and 
hence Tg), which correlates well with folding ability for both values of Bq and different simulation temperatures; the 
relationship between a and the energy gap (Eqn. ^ breaks down for sequences with Bq = —2.0 because they deviate 
strongly from two-state behavior due to the large separation in temperature of the collapse and folding transitions. 
Moreover, Zb, unlike Tg and a, can be calculated without either explicit enumeration of all the conformations or 
Monte Carlo simulation. Consequently, Zb is not only more practical for protein design, but it is a true predictor of 
folding ability. 

The correlation of the folding rate with the energy gap can be understood in terms of its effect on the energy sur- 
face. For random 27-mer sequences, SSK found that folding proceeds by a fast collapse to a semi-compact random 
globule, followed by a slow, non-directed search through the (^ 10^) semi-compact structures for one of the (^ lO'^) 
transition states that lead rapidly to the native conformation A large energy gap results in a native-like transition 
state that is stable at a temperature high enough for the folding polypeptide chain to overcome barriers between random 
semi-compact states. As the energy gap increases to the levels obtainable in designed sequences, the model exhibits 
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Hammond behavior [|_4|1 in that the fraction of native contacts required in the transition state from which the chain 
folds rapidly to the native state decreases; random sequences with relatively small gaps must form 80% of the native 



contacts while designed sequences with large gaps need form only 20% 1 15 , 16|. This decrease in the number of 
native contacts in the transition state as the energy gap increases is consistent with the the behavior of homogeneous 
systems that coalesce into droplets. Strengthening the interparticle interactions accelerates the process by lowering the 
energy of the transition state and decreasing its size jl7| ] . 
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Table I: Pearson linear correlation coefficients (rxy) between measures of stability and the logarithm of the mean first 
passage time for simulations at T = Tg taken pairwise for 15-mers with Bo = —2.0 (above) and with Bo = —0.1 
(below), r^y = a^y/a^cTy = Y.7i^i ~ - {y))/V[l27i^i ~ {^))^Y.tiyt " iv))^]- A perfect correlation has 

Txy = 1, a perfect anti-correlation has r.j;y = —1. Three of the 15-mer sequences with Bq = —0.1 repeatedly did 
not succeed in folding in the allotted time, and so were excluded from the statistical analysis. These sequences fail to 
find the native state even though the number of MC steps is much larger than the number of possible conformations 
because the simulation temperature is low, so that most steps are rejected. 
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Table II: The same as Table | for 27-mers with Bq — —0.1. All (above) and those that have fully compact ground 
states (below). 
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Table III: Pearson linear correlation coefficients between measures of stability and the logarithm of the mean first 
passage time for 15-mers simulated with a single temperature. 
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Figure 2: 15-mer dependence of the mean-first passage time on a, T,, and Ze for sequences with = —0.1 (•), 
those with = —2.0 (o), and those that were excluded from the statistics (+) (see Table | caption). For the sequences 
excluded from the statistics, first passage times of 10^ were substituted for the trials which failed to find the native 
state. 
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Figure 3: Same as Fig. ^for 27-mers. 
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